Acknowledgements
This data was gathered by Jake Daniels. It covers data collected on the SEO tag between 2018-01-01 and 2018-12-31 from Medium.com. This includes: the title, date of publication, claps generated, author, reading time, the url. Full text of the articles will be added in the final version.
There’s a total of 19152 articles in this dataset.
Here’s a look at what that looks like:
Time-series
This is what the article volume looks like over time. This will be scaled beyond a one-year outlook in the final reports.
Post Volume
Aggregated by week. Look for seasonality.
Weekly Topics
We can find the most relevant word of each week by using term-frequency-in-document-frequency or TF-IDF.
Let’s take those keywords and take them astep further by looking at the phrases that also occured that week and see how they relate to the keyword.
Below is a table of the three most relevant phrases along with their keyword.
You can sort by the highest clap averages and see which keyword and phrases contributed to that week being popular.
You can examine the amount of geometric claps that were generated (a measure of success) and the volume of posts (as an adequete sample size).
Frequent Terms
Here are words and phrases that are most used in headlines. The phrases have been stemmed to best gather their relations.
Word and phrase counts give a good signal of what’s being discussed. However, we want to use data science to look further into the effectiveness of these words.
Word Networks
We can create networks of these relationships based on how often these words occur beside each other.
Below is a network of the correlated words in the article headers. Each grouping represents a topic.
If we add another dimension, then we can see which of the networks are most effective for generating claps.
How about another dimension? The size of the circles now reflect the volume of that word.
Need help reading the final chart?
Positive Trends * Red is good, especially when it’s a larger node * Networks with red in them represent topics that are popular * Small red nodes will represent under-utilized topics
Negative Trends * Blue is ineffective, especialy when it’s a larger node * Blue nodes MAY have topics that have yet to be packaged correctly * White is neutral, these words/topics are performing at an average rate
Each connected node is part of a topic. We depend on the colours to distinguish which are good at generating claps and which do not.
Topic Clusters
The networks above show relationships between words that create topics.
We’ll try to find topics in the data by looking for clusters of words.We use unsupervised machine-learning to do this– that means we have no desired outcome for the computer to find, so it just digs for patterns that naturally occur in the dataset.
It’s a simple way to get a feel for big trends in the data and what’s currently underway in the industry.
Here’s 5 clusters that hold 8 words to describe the topic:
This creates our topic clusters! Great for brainstorming content and knowing what’s commonly talked about. Let’s shorten the amount of words and increase the number of clusters and see if new ideas emerge.
Tweaking the numbers can form different topics.
These topics can typically be inferenced. It’s not too hard to figure out what each cluster can represent.
- Topic 1: Web Design Services
- Topic 2: Reasons to use Social Media for your Business
- Topic 3: Wordpress Tips for Blogging
- Topic 4: Local SEO
- Topic 5: Digital Marketing Training
- Topic 6: Increase your website traffic
- Topic 7: Free Tools and Guide
- Topic 8: SEO Rankings
Word Impact
Here are words that are impactful/overused. And words that are proven to bring claps.
The size is based on another measurement called geometric mean. It is often used when data is highly skewed. It can be a tie-breaker for clusters that are close together.
We can also make these charts interactive, so anyone can inspect the points.
- Blue (OPPORTUNITY) - topics that perform well when they are written about… which isn’t often enough
- Gold (RELIABLE) - strong topics to write about
- Red (POOR TOPICS) - topics that are written about A LOT and do not generate many claps
- Dark Grey - average
And here’s a table of those terms. More green adds credibility to the geometric mean (pink) being accurate. Similar to our chart above.
| Word | Geometric Average | Occurences |
|---|---|---|
| content | 1.66 | 773 |
| write | 1.59 | 265 |
| blog | 1.12 | 641 |
| googl | 1.09 | 1337 |
| tool | 1.09 | 441 |
| start | 1.07 | 184 |
| guid | 1.05 | 434 |
| creat | 0.94 | 230 |
| post | 0.91 | 206 |
| organ | 0.88 | 210 |
| traffic | 0.87 | 613 |
| site | 0.84 | 572 |
| step | 0.84 | 318 |
| increas | 0.78 | 321 |
| search | 0.78 | 1350 |
| rank | 0.77 | 795 |
| build | 0.76 | 417 |
| trend | 0.76 | 207 |
| boost | 0.75 | 297 |
| keyword | 0.75 | 408 |
| reason | 0.75 | 304 |
| strategi | 0.73 | 640 |
| wordpress | 0.73 | 324 |
| link | 0.72 | 398 |
| page | 0.72 | 553 |
- Pink and Green are reliable.
- Pink and white are opportunities
- White and Green are poor
- White and White are average